NHN Cloud 제목 추출 보정 by SmileJune · Pull Request #18 · SmileJune/techcase

SmileJune · 2026-06-01T02:07:00Z

승인된 내용

NHN Cloud Meetup 수집 데이터에서 URL 같은 값이 article title로 저장되지 않도록 제목 추출을 보정합니다.

변경 사항

NHN Cloud post API의 postPerLang.title을 정규화해서 제목으로 사용합니다.
NHN Cloud Meetup suffix가 중복되지 않도록 처리합니다.
http/https URL 또는 domain/path 형태의 제목 후보를 거부합니다.
일반 sitemap HTML 수집에서도 URL 같은 제목 후보만 있으면 article을 skip합니다.
sitemap crawler 회귀 테스트를 추가했습니다.
개발 로그에 보정 내용과 검증 결과를 기록했습니다.

의도적으로 제외한 것

데이터베이스 마이그레이션은 없습니다.
운영 DB 직접 수정은 없습니다. 현재 로컬/운영 DB에서 URL 형태 NHN Cloud title은 0건으로 확인했습니다.
검색 랭킹/Elasticsearch 쿼리 변경은 없습니다.
자동 머지는 없습니다.

검증

uv run pytest tests/test_sitemap_crawler.py -> 4 passed
uv run ruff check app/crawler/sitemap.py tests/conftest.py tests/test_sitemap_crawler.py -> All checks passed
로컬 DB와 운영 DB에서 nhn-cloud-meetup source의 URL 형태 title count가 0건임을 확인

사람이 확인할 방법

PR 브랜치를 체크아웃합니다.
cd apps/backend로 이동합니다.
위 pytest와 ruff 명령을 실행합니다.
NHN Cloud title이 URL 형태일 때 SkippedArticleError로 처리되는 테스트를 확인합니다.

Summary by CodeRabbit

Bug Fixes
- Improved article title validation to prevent URL-like strings from being stored as titles.
- Articles without valid titles are now properly skipped rather than using fallback URLs.
- NHN Cloud Meetup articles receive properly normalized titles with consistent formatting.

coderabbitai · 2026-06-01T02:07:07Z

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: d69ea709-4cf4-4a74-8d2b-59d500b07c84

📥 Commits

Reviewing files that changed from the base of the PR and between 3a4bf33 and 46c1417.

📒 Files selected for processing (4)

apps/backend/app/crawler/sitemap.py
apps/backend/tests/conftest.py
apps/backend/tests/test_sitemap_crawler.py
docs/development-log.md

📝 Walkthrough

Walkthrough

The PR adds title validation and cleaning logic to the sitemap crawler to prevent URL-like strings from being stored as article titles. New helper functions detect and filter URL-like titles, with NHN Cloud-specific normalization. Both NHN Cloud API extraction and general HTML parsing now raise SkippedArticleError when all title candidates are invalid, rather than falling back to the page URL.

Changes

Article Title Validation and Cleaning for Sitemap Crawler

Layer / File(s)	Summary
Title validation and normalization helpers `apps/backend/app/crawler/sitemap.py`	Adds constant `NHN_CLOUD_MEETUP_TITLE_SUFFIX`, introduces `is_url_like_title()`, `clean_article_title()`, and `normalize_nhn_cloud_title()` utilities. Updates `first_heading_title()` to use `clean_article_title()` when extracting `<h1>` text.
NHN Cloud API extraction with title validation `apps/backend/app/crawler/sitemap.py`, `apps/backend/tests/test_sitemap_crawler.py`	`extract_nhn_cloud_payload()` now normalizes API-provided titles via `normalize_nhn_cloud_title()` and raises `SkippedArticleError` for missing/URL-like titles. Tests verify correct title extraction and rejection of URL-like candidates.
General article extraction with title validation `apps/backend/app/crawler/sitemap.py`, `apps/backend/tests/test_sitemap_crawler.py`	`extract_article_payload()` tries multiple cleaned title sources (meta tags, document short title, HTML title, first heading) and raises `SkippedArticleError` when all are invalid. Tests confirm the function skips articles when all title candidates are URL-like.
Title cleaning validation unit tests `apps/backend/tests/test_sitemap_crawler.py`	Direct tests for `clean_article_title()` confirm it returns `None` for URL-like inputs and preserves normal text titles.
Test infrastructure and mock helpers `apps/backend/tests/conftest.py`, `apps/backend/tests/test_sitemap_crawler.py`	Adds Python path setup in conftest to enable backend module imports. Defines `nhn_source()` helper and `nhn_api_client()` mock using `httpx.MockTransport` to simulate NHN Cloud API responses.
Development log documentation `docs/development-log.md`	Documents the title validation changes, test commands, and verification results in the development log.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Poem

🐰 A crawler's tale, with titles clean,
No URLs where titles should gleam,
NHN suffixes dance with grace,
While URL-like strings find no place,
Tests ensure each heading's true,
And articles skip if none will do!

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (4 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The PR title clearly summarizes the main change: correcting NHN Cloud title extraction logic to prevent URL-like values from being saved as article titles.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

📝 Generate docstrings

Create stacked PR
Commit on current branch

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch ai/nhn-title-extraction

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

NHN Cloud 제목 추출 보정

46c1417

SmileJune marked this pull request as ready for review June 1, 2026 05:31

SmileJune merged commit cf54345 into main Jun 1, 2026
2 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

NHN Cloud 제목 추출 보정#18

NHN Cloud 제목 추출 보정#18
SmileJune merged 1 commit into
mainfrom
ai/nhn-title-extraction

SmileJune commented Jun 1, 2026 •

edited by coderabbitai Bot

Loading

Uh oh!

coderabbitai Bot commented Jun 1, 2026 •

edited

Loading

Walkthrough

Changes

Estimated code review effort

Poem

❌ Failed checks (1 warning)

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

SmileJune commented Jun 1, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

승인된 내용

변경 사항

의도적으로 제외한 것

검증

사람이 확인할 방법

Summary by CodeRabbit

Uh oh!

coderabbitai Bot commented Jun 1, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Estimated code review effort

Poem

❌ Failed checks (1 warning)

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

SmileJune commented Jun 1, 2026 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented Jun 1, 2026 •

edited

Loading